Data science is an interdisciplinary field about scientific methods, processes and systems to extract Knowledge or insights from data in various forms, either structured or unstructured.
The three components involved in data science are organising, packaging and delivering data.
The 3 step OPD Data Science Process
Organising data involves the physical storage and format of data and incorporated best practices in data management.
Packaging data involves logically manipulating and joining the underlying raw data into a new representation and package.
Delivering data involves ensuring that the message the data has is being accessed by those that need to hear it.
Instead of writing code, you feed data to the generic algorithm and it builds its own logic based on the data.
There are so many different types of Machine Learning systems that it is useful to classify them in broad categories based on:
Supervised learning is where you have input variables (x) and an output variable (Y) and you use an algorithm to learn the mapping function from the input to the output. The goal is to approximate the mapping function so well that when you have new input data (x) that you can predict the output variables (Y) for that data.
A classification problem is when the output variable is a category, such as “red” or “blue” or “disease” and “no disease”.
A regression problem is when the output variable is a real value, such as “rupees” or “weight”.
In Machine Learning an attribute is a data type (e.g., “Mileage”), while a feature has several meanings depending on the context, but generally means an attribute plus its value (e.g., “Mileage = 15,000”). Many people use the words attribute and feature interchangeably, though.
Unsupervised learning is where you only have input data (X) and no corresponding output variables.
The goal for unsupervised learning is to model the underlying structure or distribution in the data in order to learn more about the data.
A clustering problem is where you want to discover the inherent groupings in the data, such as grouping customers by purchasing behavior.
An association rule learning problem is where you want to discover rules that describe large portions of your data, such as people that buy X also tend to buy Y.
Some algorithms can deal with partially labeled training data, usually a lot of unlabeled data and a little bit of labeled data.
Some photo-hosting services, such as Google Photos, are good examples of this.
Problems for which existing solutions require a lot of hand-tuning or long lists of rules: one Machine Learning algorithm can often simplify code and perform better.
Complex problems for which there is no good solution at all using a traditional approach: the best Machine Learning techniques can find a solution.
Reinforcement Learning is a very different beast. The learning system, called an agent in this context, can observe the environment, select and perform actions, and get rewards
Another criterion used to classify Machine Learning systems is whether or not the system can learn incrementally from a stream of incoming data.
In batch learning, the system is incapable of learning incrementally: it must be trained using all the available data.
In online learning, you train the system incrementally by feeding it data instances sequentially, either individually or by small groups called mini-batches. Each learning step is fast and cheap, so the system can learn about new data on the fly, as it arrives
One more way to categorize Machine Learning systems is by how they generalize.
The system learns the examples by heart, then generalizes to new cases using a similarity measure
Another way to generalize from a set of examples is to build a model of these examples, then use that model to make predictions. This is called model-based learning
R is an integrated suite of software facilities for data manipulation, calculation and graphical display. [4]
RStudio is an open source and enterprise-ready professional software for R. [5]
Python is a general-purpose interpreted, dynamically typed, interactive, object-oriented, and high-level programming language. [6] There are two main versions of python in use currently, Python 2 and Python 3.
Anaconda is a freemium open source distribution of the Python and R programming languages. [7] It is used for large-scale data processing, predictive analytics, and scientific computing, that aims to simplify package management and deployment.
Its package management system is conda.
PyCharm is an Integrated Development Environment (IDE) used in computer programming, specifically for the Python language.[8]
PyCharm is cross-platform, with Windows, macOS and Linux versions.
Atom is a free and open-source, text and source code editor. [9] Available for macOS, Linux, and Microsoft Windows with support for plug-ins written in Node.js, and embedded Git Control, developed by GitHub.
The shell is a program that takes your commands from the keyboard and gives them to the operating system to perform
Shell is an environment in which we can run our commands, programs, and shell scripts.
It is a means of interacting with a computer program where client communicates with a system through text based commands
PuTTY is a client program for the SSH, Telnet and Rlogin network protocols. These protocols are all used to run a remote session on a computer, over a network. [10]
Dask is a flexible parallel computing library for analytic computing. [11]
TensorFlow™ is an open source software library for numerical computation using data flow graphs. [12]
PyTorch is a python package that provides two high-level features: [13]
Apache Spark is an open-source cluster-computing framework. [14]
It is a fast statistical, machine learning & math runtime for bigdata. [15]
Open-Source, Distributed, Deep Learning Library for the JVM [16]
Watson is a question answering computer system capable of answering questions posed in natural language, developed by IBM. [17]
Apache Cassandra is a free and open-source distributed NoSQL database management system. [18] It is designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure.
[1] Frank Chamaki blog
[2] Victor Lavrenko
[3] Venturisity Blog